1 Introduction

Among all cancer types, lung cancer is the most lethal among both males and females. It is responsible for 28% of cancer-related mortality. The life expectation for those who have lung cancer are very poor. It is usually diagnosed in an advanced phase, and the patients only have a 16% survival rate after 5 years of their diagnosis.

The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here, we analyze the expression profiles of those patients, using a pipeline based on the R/Bioconductor software package Rsubread.

This document is written in R markdown and should be processed using R and you need to install the packages knitr and markdown. Moreover, it using the official style for Bioconductor vignettes facilitated by the Bioconductor package BiocStyle. It is compiled from different files following the instructions included in the makefile.

2 Quality assessment

2.1 Data import

Lung cancer, specifically lung squamous cell carcinoma, was assessed and analyzed in this experiment. We start importing the raw table of counts.

library(SummarizedExperiment)

se.unpaired <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se.unpaired
class: RangedSummarizedExperiment 
dim: 20115 553 
metadata(5): experimentData annotation cancerTypeCode
  cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
  TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
  TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
  lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(se.unpaired$type)

normal  tumor 
    51    502 

From this, we see that 20115 genes from a total of 553 samples are in this data. We can also observe that this dataset contains 502 tumor samples and 51 normal samples, which matches the information of the dataset from The Cancer Genome Atlas (TCGA) program.

Next, we explored the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata. For example, the column “type” consists of tumor and normal labels. Other information such as gender, tumor stage, and smoking status are also shown.

In any case, we can access all the possible labels with the last line of code, which returns two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.

dim(colData(se.unpaired))
[1] 553 549
colData(se.unpaired)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
                                 type                     bcr_patient_uuid
                             <factor>                             <factor>
TCGA.18.3406.01A.01R.0980.07    tumor 95b83006-02c9-4c4d-bf84-a45115f7d86d
TCGA.18.3407.01A.01R.0980.07    tumor 4e1ad82e-23c8-44bb-b74e-a3d0b1126b96
TCGA.18.3408.01A.01R.0980.07    tumor d4bc755a-2585-4529-ae36-7e1d88bdecfe
TCGA.18.3409.01A.01R.0980.07    tumor b09e872a-e837-49ec-8a27-84dcdcabf347
TCGA.18.3410.01A.01R.0980.07    tumor 99599b60-4f5c-456b-8755-371b1aa7074e
                             bcr_patient_barcode form_completion_date
                                        <factor>             <factor>
TCGA.18.3406.01A.01R.0980.07        TCGA-18-3406             2011-3-9
TCGA.18.3407.01A.01R.0980.07        TCGA-18-3407             2011-3-9
TCGA.18.3408.01A.01R.0980.07        TCGA-18-3408             2011-3-9
TCGA.18.3409.01A.01R.0980.07        TCGA-18-3409            2011-3-17
TCGA.18.3410.01A.01R.0980.07        TCGA-18-3410             2011-4-4
                             prospective_collection
                                           <factor>
TCGA.18.3406.01A.01R.0980.07                     NO
TCGA.18.3407.01A.01R.0980.07                     NO
TCGA.18.3408.01A.01R.0980.07                     NO
TCGA.18.3409.01A.01R.0980.07                     NO
TCGA.18.3410.01A.01R.0980.07                     NO
mcols(colData(se.unpaired), use.names=TRUE)
DataFrame with 549 rows and 2 columns
                                                         labelDescription
                                                              <character>
type                                           sample type (tumor/normal)
bcr_patient_uuid                                         bcr patient uuid
bcr_patient_barcode                                   bcr patient barcode
form_completion_date                                 form completion date
prospective_collection            tissue prospective collection indicator
...                                                                   ...
lymph_nodes_pelvic_pos_total                               total pelv lnp
lymph_nodes_aortic_examined_count                           total aor lnr
lymph_nodes_aortic_pos_by_he                          aln pos light micro
lymph_nodes_aortic_pos_by_ihc                                 aln pos ihc
lymph_nodes_aortic_pos_total                                total aor-lnp
                                        CDEID
                                  <character>
type                                       NA
bcr_patient_uuid                           NA
bcr_patient_barcode                   2673794
form_completion_date                       NA
prospective_collection                3088492
...                                       ...
lymph_nodes_pelvic_pos_total          3151828
lymph_nodes_aortic_examined_count     3104460
lymph_nodes_aortic_pos_by_he          3151832
lymph_nodes_aortic_pos_by_ihc         3151831
lymph_nodes_aortic_pos_total          3151827

Now, we explore the row (feature) data, which provide information on genes. For the first line of command, we can see the length of each gene and the GC content. The second line of command provides further information, such as the ranges within individual chromosome.

rowData(se.unpaired)
DataFrame with 20115 rows and 3 columns
               symbol     txlen              txgc
          <character> <integer>         <numeric>
1                A1BG      3322 0.564419024683925
2                 A2M      4844 0.488232865400495
9                NAT1      2280 0.394298245614035
10               NAT2      1322 0.389561270801815
12           SERPINA3      3067 0.524942940984676
...               ...       ...               ...
100996331       POTEB      1706 0.430832356389215
101340251    SNORD124       104 0.490384615384615
101340252   SNORD121B        81 0.407407407407407
102724473      GAGE10       538 0.505576208178439
103091865   BRWD1-IT2      1028 0.592412451361868
rowRanges(se.unpaired)
GRanges object with 20115 ranges and 3 metadata columns:
            seqnames            ranges strand |      symbol     txlen
               <Rle>         <IRanges>  <Rle> | <character> <integer>
          1    chr19 58345178-58362751      - |        A1BG      3322
          2    chr12   9067664-9116229      - |         A2M      4844
          9     chr8 18170477-18223689      + |        NAT1      2280
         10     chr8 18391245-18401218      + |        NAT2      1322
         12    chr14 94592058-94624646      + |    SERPINA3      3067
        ...      ...               ...    ... .         ...       ...
  100996331    chr15 20835372-21877298      - |       POTEB      1706
  101340251    chr17 40027542-40027645      - |    SNORD124       104
  101340252     chr9 33934296-33934376      - |   SNORD121B        81
  102724473     chrX 49303669-49319844      + |      GAGE10       538
  103091865    chr21 39313935-39314962      + |   BRWD1-IT2      1028
                         txgc
                    <numeric>
          1 0.564419024683925
          2 0.488232865400495
          9 0.394298245614035
         10 0.389561270801815
         12 0.524942940984676
        ...               ...
  100996331 0.430832356389215
  101340251 0.490384615384615
  101340252 0.407407407407407
  102724473 0.505576208178439
  103091865 0.592412451361868
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

In this analysis, we wanted to compare tumor and normal cells. In order to do this, we wanted to compare tumor and normal cells that come from the same patient. This also allows to pair the samples which reduces the possibility of false positives. In order to get them, we execute the following code.

# get the number of occurences of each patient
occur <- data.frame(table(substr(colnames(se.unpaired), 9, 12)))

# get those that occur twice
paired_df <- occur[occur$Freq > 1,]
paired <- as.vector(paired_df$Var1)
paired <- paired[2:length(paired)]

mask <- is.element(substr(colnames(se.unpaired), 9, 12), paired)

se <- se.unpaired[ ,mask]

saveRDS(se, file.path("results", "se.paired.rds"))

Since this data is unprocessed, we must first perform a quality assessment and normalize accordingly. To do this, we need first to load the edgeR R/Bioconductor package. This requires the creation of a `DGEList’ object, as the package doesn’t work directly with SummarizedExperiment objects. In any case, we will update any change in the DGElist object to the SummarizedExperiment object.

library(edgeR)

dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, file.path("results", "dge.paired.rds"))

One way of normalizing is to use the counts per million (CPM) values. This value is calculated by dividing the number of counts of each sample by 1 million. Instead of using this directly, we calculate \(\log_2\) CPM values of expression because it has nicer distributional properties than raw counts or non-logged CPM units. We set this information as an additional assay element to ease their manipulation.

assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(se)$logCPM[1:5, 1:5]
   TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1                      2.433139                     1.760377
2                      9.013463                    10.810625
9                     -6.817220                    -6.817220
10                    -6.817220                    -6.817220
12                     5.605909                     5.647356
   TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1                      2.122666                     0.775287
2                      7.713750                    11.100735
9                     -6.817220                    -6.817220
10                    -6.817220                    -6.817220
12                     4.823947                     5.856939
   TCGA.22.5478.01A.01R.1635.07
1                      3.713844
2                      9.448174
9                     -6.817220
10                    -6.817220
12                     4.920820

2.2 Sequencing depth

Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

This figure reveals no big changes in sequencing depth between samples. In addition, we see a uniform distribution of tumor and normal samples across the figure. We see a cluster of normal samples on the right side with a higher cpm, but don’t believe this will affect our analyses. We can obtain the samples with less sequencing depth using the commands below.

sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
sort(sampledepth)
43.6773 56.7823 77.7335 56.8309 56.8083 58.8386 77.7335 43.6773 56.7582 
   28.8    29.0    31.0    32.0    32.9    33.4    34.9    35.0    35.9 
56.8201 56.8201 56.8623 34.8454 77.7142 56.7579 56.7730 56.8083 56.8082 
   36.5    37.0    37.0    37.6    38.7    39.7    40.2    40.3    40.4 
56.8082 51.4081 34.8454 22.5481 56.8309 77.7337 33.4587 56.7823 56.7580 
   40.7    41.9    42.1    42.3    43.6    44.0    45.6    45.7    45.8 
56.7579 85.7710 58.8386 34.7107 56.7731 22.5471 77.7142 56.7580 56.7731 
   45.9    46.5    46.6    47.3    47.6    47.8    48.4    49.7    50.0 
85.7710 43.3394 39.5040 22.5472 60.2709 51.4079 90.6837 51.4080 77.7338 
   50.1    50.2    51.0    51.1    51.2    51.5    51.6    51.9    52.0 
56.7222 77.7338 77.7138 56.7582 22.5471 43.5670 22.4609 22.5483 92.7340 
   52.4    52.4    52.8    52.9    53.6    53.6    54.2    54.4    54.6 
77.8008 34.7107 90.7767 43.7657 22.4609 43.7657 77.7138 33.6737 43.7658 
   54.9    55.1    55.7    55.8    56.3    56.7    57.0    57.2    57.4 
77.7337 77.8007 43.5670 56.7222 56.8623 33.4587 43.7658 43.6143 22.5481 
   57.6    57.6    58.0    58.0    59.1    59.6    59.9    60.0    60.3 
22.5491 43.6647 92.7340 39.5040 22.5489 22.5482 77.8008 22.5489 22.5491 
   61.0    61.3    61.3    61.5    63.5    64.5    65.2    65.5    66.0 
77.8007 22.5478 90.6837 90.7767 56.7730 33.6737 60.2709 22.5478 43.6143 
   66.4    68.0    68.2    68.4    68.8    69.3    69.8    71.0    75.9 
22.5472 22.5482 22.4593 43.6771 22.5483 43.6647 22.4593 43.6771 51.4081 
   76.2    77.0    77.9    79.4    80.3    81.5    85.0    88.8   103.7 
51.4080 51.4079 43.3394 
  119.5   122.6   124.1 

2.3 Distribution of expression levels among samples

Next, we will look at the distribution of expression values per sample in terms of logarithmic CPM units. We display tumor and normal samples separately, and are shown in Figure 2. Each colored line in the graphs below represent a sample. In both graphs, we see two modes. The low mode represents genes not expressed in that sample. The second mode represents the genes that are expressed.

Non-parametric density distribution of expression profiles per sample.

Figure 2: Non-parametric density distribution of expression profiles per sample

In both graphs, we see some genes that deviate a little from the rest. These could be potential outliers and we have noted them for later analysis.

2.4 Distribution of expression levels among genes

Next, we wanted to look at the distribution of expression levels across genes. To do this, we calculate the average expression per gene through all the samples. Figure 3 shows the distribution of those values.

Distribution of average expression level per gene.

Figure 3: Distribution of average expression level per gene

2.5 Filtering of lowly-expressed genes

RNA sequence expression profiles that come from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. Thus, it is common to set a threshold and filter out genes that fall below this value. In the light of this plot above, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes. Now, we have a total of 11872 genes and 102 samples.

mask <- avgexp > 1
dim(se)
[1] 20115   102
se.filt <- se[mask, ]
dim(se.filt)
[1] 11872   102
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 11872   102

We then store the un-normalized versions of the filtered expression data.

saveRDS(se.filt, file.path("results", "se.paired.filt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.unnorm.rds"))

2.6 Normalization

Next, we calculate the normalization factors on the filtered expression data set using the calcNormFactors function.

dge.filt <- calcNormFactors(dge.filt)

Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)

Store normalized versions of the filtered expression data.

saveRDS(se.filt, file.path("results", "se.paired.filt.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.rds"))

2.7 MA-plots

We examine now the MA-plots of the normalized expression profiles. These plots allow us to visualize how one sample compares to the average of the rest of the samples. We look first to the tumor samples in Figure 4.

MA-plots of the tumor samples.

Figure 4: MA-plots of the tumor samples

In the MA-plots of the tumor samples, we observed the following:

  • 3 samples with high differences from the mean, with a threshold ( M > 2 or M < -2). Their MA plots have the regression line in red.
  • 14 samples with moderate differences from the mean, which we define as ( M <= 2 and M >= 1 or M >=-2 and M <= -1). Their regression line is brown.
  • 34 samples with discrete differences from the mean as they are (M < 1 or M > -1). Their regression line is green.

We observe that the appearance of samples that differ from the mean can be problematic. If during downstream analysis we observe unexpected results, those samples should be removed together with their normal pairs. We can obtain their names with the following code:

big_c
[1] "TCGA.56.7823" "TCGA.56.8623" "TCGA.77.7138"

Next, we look now to the normal samples in Figure 5.

MA-plots of the normal samples.

Figure 5: MA-plots of the normal samples

In this case, we observed:

  • 0 samples with high differences from the mean, with a threshold ( M > 2 or M < -2). Their MA plots have the regression line in red.
  • 0 samples with moderate differences from the mean, which we define as ( M <= 2 and M >= 1 or M >=-2 and M <= -1). Their regression line is brown.
  • 51 samples with discrete differences from the mean as they are (M < 1 or M > -1). Their regression line is green.

In conclusion, we do not observe either important expression-level dependent biases among the normal samples.

2.8 Batch identification

Batch effects are sub-groups of measurements that have a qualitatively different behavior across conditions and are unrelated to the variables in the study. It is important to determine if batch effects are present to know if any confounding variables are affecting the data. We will search now for potential surrogate of batch effect indicators within normal and tumor samples. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.

tss <- substr(colnames(se.filt), 6, 7)
table(data.frame(TYPE=se.filt$type, TSS=tss))
        TSS
TYPE     22 33 34 39 43 51 56 58 60 77 85 90 92
  normal 10  2  2  1  8  3 12  1  1  7  1  2  1
  tumor  10  2  2  1  8  3 12  1  1  7  1  2  1
center <- substr(colnames(se.filt), 27, 28)
table(data.frame(TYPE=se.filt$type, CENTER=center))
        CENTER
TYPE     07
  normal 51
  tumor  51
plate <- substr(colnames(se.filt), 22, 25)
table(data.frame(TYPE=se.filt$type, PLATE=plate))
        PLATE
TYPE     0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2247 2296 2326
  normal    0    0    5    4    7    1    4   10   10    2    4    2    1
  tumor     1    3    6    0    7    0    4   10   10    2    4    2    1
        PLATE
TYPE     A28V
  normal    1
  tumor     1
portionanalyte <- substr(colnames(se.filt), 18, 20)
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R 21R 31R 41R
  normal  48   3   0   0   0
  tumor   11  28   8   2   2
samplevial <- substr(colnames(se.filt), 14, 16)
table(data.frame(TYPE=se.filt$type, SAMPLEVIAL=samplevial))
        SAMPLEVIAL
TYPE     01A 01B 11A
  normal   0   0  51
  tumor   50   1   0

From this information we can make the following observations:

  • All samples were sequenced at the same center
  • Tumor and normal samples were collected in the same vial except one:
colnames(se.filt)[samplevial == "01B"]
[1] "TCGA.56.7823.01B.11R.2247.07"
  • Samples were collected evenly across different tissue source sites (TSS).
  • Samples were mostly well distributed in different plates.
  • Samples of normal cells concentrate into two different portion of analyte, while the tumor samples have five different portions.

With those results, we use the portion of analyte as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with portion of analyte.

table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R 21R 31R 41R
  normal  48   3   0   0   0
  tumor   11  28   8   2   2

As we can see here, the majority of normal samples (48) come from the “01R”" analyte and 3 come from the “11R” analyte. However, the tumor samples span across five different analytes. This could potentially lead to a batch effect.

In order to examine how the samples group together, we use two approaches: hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the surrogate of batch indicator, or portion of analyte in our case.

We perform this under a subset of our samples, as we wish to have a matrix without zeros. We remove samples belonging to portions 21R, 31R and 41R; and as the experiment is paired, also their pairs.

por1 <- as.vector(substr(colnames(se.filt)[portionanalyte == "41R"], 9, 12))
por2 <- as.vector(substr(colnames(se.filt)[portionanalyte == "31R"], 9, 12))
por3 <- as.vector(substr(colnames(se.filt)[portionanalyte == "21R"], 9, 12))
allpor <- c(por1, por2, por3)

table(is.element(substr(colnames(se.filt), 9, 12), allpor))

FALSE  TRUE 
   78    24 
se.batch <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), allpor)]
dge.batch <- DGEList(counts=assays(se.batch)$counts, genes=mcols(se.batch))

new.portionanalyte <- substr(colnames(se.batch), 18, 20)
table(data.frame(TYPE=se.batch$type, PORTIONANALYTE=new.portionanalyte))
        PORTIONANALYTE
TYPE     01R 11R
  normal  36   3
  tumor   11  28
dim(dge.batch)
[1] 11872    78

We see that we are able to remove 24 samples, which correspond to the 12 in the portions we want to remove and their pairs.

We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 6.

logCPM <- cpm(dge.batch, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(new.portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.batch)
outcome <- paste(substr(colnames(se.batch), 9, 12), as.character(se.batch$type), sep="-")
names(outcome) <- colnames(se.batch)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 6: Hierarchical clustering of the samples

We can observe that samples cluster primarily by sample type, tumor or normal. We observe that the different portions are present in both clusters, so we don’t consider that it is inducing batch effect.

We also see that one of the tumor samples, 8623-tumor, is present in the normal samples cluster. This one also had a strong deviation in the MA plot, so we might consider discarding it.

In Figure 7 we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples, and we don’t see any sample that is separated from any cluster.

plotMDS(dge.batch, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))),
       fill=sort(unique(batch)), inset=0.05)
Multidimensional scaling plot of the samples.

Figure 7: Multidimensional scaling plot of the samples

Finally, under the consideration that there is no batch effect, the 24 samples that were removed in order to make the dendogram and the MDS plot will remain for further analysis. We decide to generate the dendogram in Figure 8 again in order to see if any other sample appears in its opposite cluster, apart from 8623-tumor.

logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Hierarchical clustering of the samples.

Figure 8: Hierarchical clustering of the samples

We observe that only the 8623-tumor sample doesn’t cluster with the rest of the tumor samples.

Sample removal:

dim(se.filt)
[1] 11872   102
dim(dge.filt)
[1] 11872   102
number <- as.character(8623)

se.filt <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), number)]
dge.filt <- DGEList(counts=assays(se.filt)$counts, genes=mcols(se.filt))

dim(se.filt)
[1] 11872   100
dim(dge.filt)
[1] 11872   100

2.9 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] geneplotter_1.60.0          annotate_1.60.1            
 [3] XML_3.98-1.20               AnnotationDbi_1.44.0       
 [5] lattice_0.20-38             edgeR_3.24.3               
 [7] limma_3.38.3                SummarizedExperiment_1.12.0
 [9] DelayedArray_0.8.0          BiocParallel_1.16.6        
[11] matrixStats_0.54.0          Biobase_2.42.0             
[13] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[15] IRanges_2.16.0              S4Vectors_0.20.1           
[17] BiocGenerics_0.28.0         knitr_1.23                 
[19] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1             RColorBrewer_1.1-2     compiler_3.5.3        
 [4] BiocManager_1.30.4     highr_0.8              XVector_0.22.0        
 [7] bitops_1.0-6           tools_3.5.3            zlibbioc_1.28.0       
[10] bit_1.1-14             digest_0.6.19          memoise_1.1.0         
[13] RSQLite_2.1.1          evaluate_0.14          Matrix_1.2-17         
[16] DBI_1.0.0              yaml_2.2.0             xfun_0.7              
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0          bit64_0.9-7           
[22] locfit_1.5-9.1         grid_3.5.3             rmarkdown_1.13        
[25] bookdown_0.11          blob_1.1.1             magrittr_1.5          
[28] codetools_0.2-16       htmltools_0.3.6        xtable_1.8-4          
[31] KernSmooth_2.23-15     stringi_1.4.3          RCurl_1.95-4.12       

3 Differential expression analysis

Differential expression analysis consists of identifying changes in gene expression. For RNAseq experiments, this means comparing relative concentrations of RNA molecules. In R, this can be done by using the limma and sva R/Bioconductor packages.

3.1 Analysis

limma allows for linear model analysis of data for DE. This requires building two model matrices: a matrix of the model including the outcome of interest and adjustment variables, and a matrix of the null model that only includes adjustment variables. The outcome of interest is the type of sample (tumor vs normal), and as both types of samples come from the same patient the analysis is paired, so a unique identifier for each patient is included. Finally, as there are unknown factors to adjust, an intercept term of 1 is given to the null model.

Even if no known sources of variation were found, surrogate variables representing unknown sources of variation can be added to the model. Those can be estimated with the sva package, and later be appended to the model. Finally, voom is used to fit the model as the library size distribution is not uniform across samples, and the moderated t-statistics are calculated.

library(sva)
# Obtain the patient ID, as the analysis is paired
patientid <- substr(colnames(se.filt), 9, 12)
samplevial <- substr(colnames(se.filt), 14, 16)

# Build the model (mod) and null (mod0) matrices 
mod <- model.matrix(~type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~patientid, data = colData(se.filt))

# Estimate surrogate variables
sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  
# Add surrogates to the design matrix
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod, plot=TRUE)
Mean-variance plot for DE analysis.

Figure 9: Mean-variance plot for DE analysis

fit <- lmFit(v, mod) # Fit the model with the voom weights
fit <- eBayes(fit) # Calculate moderated t-statistics

sva was able to estimate 11 surrogate variables. Regarding the mean-variance plot shown in Figure ??, which was obtained with voom, it shows the variance in the y axis and the mean of expression in the x axis for each gene, represented as a black dot. Such plots contain a red curve that is fitted to the dots and show trends in the data. Normally increases in expression levels mean a decrease in the variance, until a plateau is reached for highly expressed genes. This trend is observed in the plot obtained for this data. Moreover, the plot is also a diagnosis of filtering steps performed upstream. The fact that there is not a drop in variance for the lower expression values indicates that the filtering threshold was good.

As we are performing multiple tests, the probability of type I errors (rejecting the null hypotheses when it is true) increases. There are multiple statistical strategies to decrease the likelihood of these errors, such as the false discovery rate (FDR). This technique consists of defining an acceptable rate of type I errors (false discoveries). Here, a FDR of 0.01 is set, which means that 1% of the rejections of the null hypothesis will be incorrect. For our data set, which has 11872 observations, up to 119 genes can be false positives (FP).

FDRcutoff <- 0.01
res <- decideTests(fit, p.value = FDRcutoff)
summary(res)
       (Intercept) typetumor patientid3394 patientid4079 patientid4080
Down             4      4089             0            16             0
NotSig         768      2682         11870         11853         11871
Up           11100      5101             2             3             1
       patientid4081 patientid4587 patientid4593 patientid4609 patientid5040
Down               1             4             1             0             0
NotSig         11870         11867         11869         11871         11870
Up                 1             1             2             1             2
       patientid5471 patientid5472 patientid5478 patientid5481 patientid5482
Down               0             0             0             5             1
NotSig         11871         11870         11871         11866         11870
Up                 1             2             1             1             1
       patientid5483 patientid5489 patientid5491 patientid5670 patientid6143
Down               0             1             0             0             1
NotSig         11871         11869         11870         11871         11869
Up                 1             2             2             1             2
       patientid6647 patientid6737 patientid6771 patientid6773 patientid6837
Down               4             1             1             1             0
NotSig         11867         11867         11869         11870         11871
Up                 1             4             2             1             1
       patientid7107 patientid7138 patientid7142 patientid7222 patientid7335
Down               1             1             5             0             4
NotSig         11870         11871         11865         11871         11866
Up                 1             0             2             1             2
       patientid7337 patientid7338 patientid7340 patientid7579 patientid7580
Down               0             1             6             0             1
NotSig         11870         11870         11862         11870         11870
Up                 2             1             4             2             1
       patientid7582 patientid7657 patientid7658 patientid7710 patientid7730
Down               1             4             4            10             1
NotSig         11870         11866         11868         11860         11869
Up                 1             2             0             2             2
       patientid7731 patientid7767 patientid7823 patientid8007 patientid8008
Down               5             0             5             5             9
NotSig         11864         11870         11866         11866         11862
Up                 3             2             1             1             1
       patientid8082 patientid8083 patientid8201 patientid8309 patientid8386
Down               6             0             0             0             0
NotSig         11865         11870         11871         11872         11871
Up                 1             2             1             0             1
       patientid8454   SV1   SV2   SV3   SV4   SV5   SV6   SV7   SV8   SV9
Down               7  1277   317   603   593    62    34    27    13   202
NotSig         11864  8901  9085 10839 10614 11547 11592 11811 11849 11565
Up                 1  1694  2470   430   665   263   246    34    10   105
        SV10  SV11
Down      65     0
NotSig 11689 11871
Up       118     1

The summary table displays the upregulated, not significant, and downregulated genes for each variable in the model.

Gene metadata such as the gene symbol and the chromosome can be added for an easier interpretation of the obtained statistics, and to observe the chromosome distribution of the differentially expressed genes.

# Get gene info...
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)

# and add it to the results
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 20)
            chr  symbol     logFC  AveExpr         t      P.Value
171482     chrX   FAM9A  6.766405 1.477480  36.85864 6.843818e-34
2261       chr4   FGFR3  4.348392 4.461243  34.53058 1.003148e-32
51569     chr13    UFM1  5.802336 3.035598  33.92022 2.085602e-32
9263       chr7  STK17A  5.509825 7.124666  32.45828 1.266931e-31
51804     chr14    SIX4  3.296163 4.291469  32.44698 1.285100e-31
23217     chr19    ZFR2  5.120053 6.564764  32.37119 1.414022e-31
54536     chr10   EXOC6  4.766545 2.127249  31.64391 3.576728e-31
100616453 chr11 MIR4687  3.527821 5.839996  31.46343 4.516773e-31
692205     chr2 SNORD89  4.647090 4.105932  31.10766 7.180600e-31
27316      chrX    RBMX  4.427552 3.627725  31.06818 7.561981e-31
6406      chr20   SEMG1  2.555202 4.942304  30.96833 8.621594e-31
2079      chr14     ERH  4.226847 1.320862  31.05712 7.672451e-31
8317       chr1    CDC7  4.744208 2.756026  30.62319 1.360661e-30
117156     chr5 SCGB3A2  4.486026 6.802293  30.38548 1.868232e-30
2149       chr5     F2R  3.206808 3.639705  30.25463 2.226567e-30
158062     chr9    LCN6  8.017857 6.999330  30.16581 2.509203e-30
10643      chr7 IGF2BP3 -4.052705 4.356854 -29.91757 3.510205e-30
11252     chr22 PACSIN2  4.913960 1.672498  29.82982 3.954882e-30
121391    chr12   KRT74  3.939914 3.777500  29.70032 4.718864e-30
5782       chr7  PTPN12  3.323264 8.759576  29.68945 4.789500e-30
             adj.P.Val        B
171482    8.124980e-30 66.95030
2261      5.954687e-29 64.56615
51569     8.253424e-29 63.71946
9263      2.797878e-28 62.09967
51804     2.797878e-28 62.06905
23217     2.797878e-28 61.99049
54536     6.066132e-28 60.90182
100616453 6.702891e-28 60.82636
692205    8.280668e-28 60.32293
27316     8.280668e-28 60.25544
6406      8.529630e-28 60.18386
2079      8.280668e-28 60.11122
8317      1.242597e-27 59.62395
117156    1.584261e-27 59.40209
2149      1.762254e-27 59.20703
158062    1.861829e-27 59.11295
10643     2.451362e-27 58.77632
11252     2.608465e-27 58.54528
121391    2.843047e-27 58.46417
5782      2.843047e-27 58.43767

This table contains the relevant statistics used in the analysis, such as raw and adjusted p-values. In the case of the later ones, it can be seen some of their values are pretty low, which means that the result is not significant. The number of differentially expressed genes under the 0.01 FDR cuttoff can be checked by applying a filter to this table.

signDEgenes <- rownames(tt)[tt$adj.P.Val < FDRcutoff]
length(signDEgenes)
[1] 9190

The differential expression analysis results in 9190 significantly differentially expressed genes. Those can still be filtered by a log fold change cutoff of 2, with the following chromosome distribution.

tt.sign <- tt[tt$adj.P.Val < FDRcutoff,]
DEgenes <- rownames(tt.sign)[abs(tt.sign$logFC) > 2]

length(DEgenes)
[1] 1455
saveRDS(DEgenes, file.path("results", "DEgenes.rds"))

So we are finally calling differentially expressed to 1455 genes. The chromosome distribution of those genes can be obtained.

tt$chr <- sub( "^chr([X|Y|0-9]+.*)", "\\1", tt$chr, perl = T)
tt$chr <- sub( "(.*)_.*_.*$", "\\1A", tt$chr, perl = T)

#sort(table(tt$chr[tt$adj.P.Val < FDRcutoff]), decreasing = TRUE)
sort(table(tt.sign$chr[abs(tt.sign$logFC) > 2]), decreasing = TRUE)

               chr1               chr19               chr11 
                154                 101                  99 
               chr2               chr17               chr12 
                 90                  86                  85 
               chr9               chr16                chr5 
                 72                  69                  65 
               chr4                chr3               chr10 
                 64                  63                  61 
               chr6                chr7                chrX 
                 59                  59                  59 
              chr20               chr15               chr14 
                 46                  43                  42 
               chr8               chr22               chr13 
                 40                  31                  27 
              chr21               chr18 chr1_KI270766v1_alt 
                 24                  15                   1 
#plot(sort(table(tt$chr[tt$adj.P.Val < FDRcutoff & !grepl(".*[A|v][1]*$", tt$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)

plot(sort(table(tt.sign$chr[abs(tt.sign$logFC) > 2 & !grepl(".*[A|v][1]*$", tt$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)
Warning in abs(tt.sign$logFC) > 2 & !grepl(".*[A|v][1]*$", tt$chr, perl =
TRUE): longer object length is not a multiple of shorter object length
Chromosome distribution of the differentially expressed genes

Figure 10: Chromosome distribution of the differentially expressed genes

The distribution is shown in Figure 10, and it does not correspond to a distribution based on the chromosome size, as some chromosomes are located higher in the ranking than what would be expected by its size. The chromosome Y does not have any differentially expressed genes.

A list of the top 10 most significant differentially expressed genes can be obtained:

top <- order(fit$lods[, 2], decreasing = TRUE)[1:10]
fit$genes$symbol[top]
 [1] "FAM9A"   "FGFR3"   "UFM1"    "STK17A"  "SIX4"    "ZFR2"    "EXOC6"  
 [8] "MIR4687" "SNORD89" "RBMX"   

3.2 Diagnostics

The distribution of raw p-values and the QQ plot moderated t-statistics, displayed in Figure 11, can be used as a diagnosis of the tests performed.

par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit$t[, 2], df = fit$df.prior + fit$df.residual, main = "", pch = ".", cex = 3)
abline(0, 1, lwd = 2)
Raw p-values distribution and QQ plot of the moderated t-statistics for DE analysis.

Figure 11: Raw p-values distribution and QQ plot of the moderated t-statistics for DE analysis

Under the null hypothesis, which is that most genes are not differentially expressed, the raw p-value distribution should be uniform with a peak at low p-values for the truly DE genes. This coincides with the observed plot.

Regarding the QQ plot, it represents the quantiles of our data sample against the theoretical distribution they should follow. The diagonal represents the null hypothesis. Most genes are off-diagonal, indicating differential expression. The fact tha most of the genes are far from the null hypothesis is a measure of the significance of the results.

The differential expression results are diagnosed with volcano and MA plots, in which the top 10 differentially expressed genes are highlighted as shown in Figure ??.

par(mfrow = c(1, 2), mar = c(4, 5, 3, 2))

volcanoplot(fit, coef = 2, highlight = 10, names = fit$genes$symbol, main = "Volcano Plot")

plot(tt$logFC, -log10(tt$P.Value), xlab="Log2 fold-change", ylab="-log10 P-value", 
     pch=".", cex=2, col=grey(0.75), cex.axis=1, cex.lab=1, las=1)

posx <- tt[tt$adj.P.Val < 0.01, "logFC"] 
posy <- -log10(tt[tt$adj.P.Val < 0.01, "P.Value"]) 
points(posx, posy, pch=".", cex=2, col="black")
Volcano and MA plots for DE analysis without SVA.

Figure 12: Volcano and MA plots for DE analysis without SVA

limma::plotMA(fit, coef = 2, status = rownames(fit$lods) %in% signDEgenes, legend = FALSE,
main = "MA Plot", hl.pch = 46, hl.cex = 4, bg.pch = 46, bg.cex = 3, las = 1)
text(fit$Amean[top], fit$coef[top, 2], fit$genes$symbol[top], cex = 0.5, pos = 4)
Volcano and MA plots for DE analysis without SVA.

Figure 13: Volcano and MA plots for DE analysis without SVA

3.3 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] sva_3.30.1                  genefilter_1.64.0          
 [3] mgcv_1.8-28                 nlme_3.1-140               
 [5] geneplotter_1.60.0          annotate_1.60.1            
 [7] XML_3.98-1.20               AnnotationDbi_1.44.0       
 [9] lattice_0.20-38             edgeR_3.24.3               
[11] limma_3.38.3                SummarizedExperiment_1.12.0
[13] DelayedArray_0.8.0          BiocParallel_1.16.6        
[15] matrixStats_0.54.0          Biobase_2.42.0             
[17] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[19] IRanges_2.16.0              S4Vectors_0.20.1           
[21] BiocGenerics_0.28.0         knitr_1.23                 
[23] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1         xfun_0.7               splines_3.5.3         
 [4] htmltools_0.3.6        yaml_2.2.0             blob_1.1.1            
 [7] survival_2.44-1.1      DBI_1.0.0              bit64_0.9-7           
[10] RColorBrewer_1.1-2     GenomeInfoDbData_1.2.0 stringr_1.4.0         
[13] zlibbioc_1.28.0        codetools_0.2-16       evaluate_0.14         
[16] memoise_1.1.0          highr_0.8              Rcpp_1.0.1            
[19] xtable_1.8-4           BiocManager_1.30.4     XVector_0.22.0        
[22] bit_1.1-14             digest_0.6.19          stringi_1.4.3         
[25] bookdown_0.11          grid_3.5.3             tools_3.5.3           
[28] bitops_1.0-6           magrittr_1.5           RCurl_1.95-4.12       
[31] RSQLite_2.1.1          Matrix_1.2-17          rmarkdown_1.13        
[34] compiler_3.5.3        

3.4 Functional enrichmentc analysis with surogate variable adjustment

3.4.1 Gene set enrichment analysis (GSEA)

In addition to identifying differential expressed genes by adjusting surrogate variables in the differential expression analysis, a gene set enrichment analysis (GSEA) was also completed. This procedure provides more sensitivity in identifying gene expression changes. This analysis assesses how genes are behaving differently between two phenotypic states. The phenotypic states in this study are tumor and normal samples. The analysis calculates an enrichment score for each gene set. This score provides information on the changes in gene expression by inidividual genes in the gene set.

The first step in running a GSEA was to select a collection of gene sets from the MSigDB gene set collections provided by the Broad Institute. The C2 collection contained 3272 computational gene sets made up of cancer-oriented microarray data.

library(GSEABase)
Loading required package: graph

Attaching package: 'graph'
The following object is masked from 'package:XML':

    addNode
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11872
#c4 <- getGmt("c4.all.v6.2.entrez.gmt")
#length(c4)

library(GSVAdata) 
Loading required package: hgu95a.db
Loading required package: org.Hs.eg.db
data(c2BroadSets) 
c2BroadSets
GeneSetCollection
  names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
  unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)
length(c2BroadSets)
[1] 3272

After the collection was selected for the GSEA, the next step was to map the identifiers from the 3272 C2 gene sets to the dataset being analyzed.

gsc <- GeneSetCollection(c(c2BroadSets))
#map identifiers
gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
  names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
  unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

A matrix was created to check if the identifiers were properly mapped, or matched to the corresponding identifier in the dataset being analyzed.

Im <- incidence(gsc)
dim(Im)
[1]  3272 29340
Im[1:2, 1:10]
                                    5167 100288400 338328 388 10631 440387
NAKAMURA_CANCER_MICROENVIRONMENT_UP    1         1      1   1     1      1
NAKAMURA_CANCER_MICROENVIRONMENT_DN    0         0      0   0     0      0
                                    54910 10136 3630 6662
NAKAMURA_CANCER_MICROENVIRONMENT_UP     1     1    1    1
NAKAMURA_CANCER_MICROENVIRONMENT_DN     0     0    0    0
Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1] 3272 9963

Since we are only interested in the genes from the data set being analyzed, all genes that were not a part of this gene set, denoted by a “0”, were discarded.

se.filt <- se.filt[colnames(Im), ]
dge.filt <- dge.filt[colnames(Im), ]

This left 9963 genes to be analyzed among 100 gene sets from the collection. With the data ready, a SVA was completed without calling any gene differentially expressed. The mean-variance was be the same from the previous differential expression analysis so the plot was not displayed.

patientid <- substr(colnames(se.filt), 9, 12)
mod <- model.matrix(~type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~ patientid, data = colData(se.filt))
sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  
# Add surrogates to the model
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))

v <- voom(dge.filt, mod)
fit <- lmFit(v, mod)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)

With the t-statistics avaialble, the z-scores were then calculated to see if a shift in gene expression within a gene set was present. First, a filter was set that required the gene sets to have a minimum size of 5 genes.

Im <- Im[rowSums(Im) >= 5, ]
dim(Im)
[1] 2760 9963

The t-statistics were then stored in a vector in order of the incidence matrix. The z-score was then calculated and sorted by the absolute score. The z-score of 5 indicated that this gene set was about 5 units above the mean. A low z-score suggested the genes in the gene set were not differentially expressed.

tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
length(tGSgenes)
[1] 9963
head(tGSgenes)
[1]   6.684433   5.595676   2.657648 -13.938881   3.125743   6.416383
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
length(zS)
[1] 2760
head(zS)
 NAKAMURA_CANCER_MICROENVIRONMENT_UP  NAKAMURA_CANCER_MICROENVIRONMENT_DN 
                           -2.971891                            -3.972357 
WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN 
                           -3.271506                            -8.729491 
                   WINTER_HYPOXIA_UP                    WINTER_HYPOXIA_DN 
                            4.224154                             9.023336 
rnkGS <- sort(abs(zS), decreasing = TRUE)
head(rnkGS)
  WHITEFORD_PEDIATRIC_CANCER_MARKERS GEORGES_TARGETS_OF_MIR192_AND_MIR215 
                            44.62003                             41.81863 
                     KEGG_CELL_CYCLE             PUJANA_XPRSS_INT_NETWORK 
                            40.98101                             40.04934 
    DODD_NASOPHARYNGEAL_CARCINOMA_DN          REACTOME_CELL_CYCLE_MITOTIC 
                            40.02843                             38.76168 

A one sample z-test was calculated and a conservative multiple testing adjustment was administered to see if any gene sets were candidates for differentially expressed genes. The reason for completing a multiple testing adjustment was due to gene set overlaps.

pv <- pmin(pnorm(zS), 1 - pnorm(zS))
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 2339

As shown, 2339 gene sets were differentially expressed.

3.4.2 Gene set variation analysis (GSVA)

In addition to the GSEA, a gene set variation analysis (GSVA) was performed. This differs from the standard GSEA by utilizing a gene set by sample matrix rather than a gene by sample matrix. This difference allowed for the pathway enrichment for each individual sample to be analyzed.

In order to do this, a matrix was created with the number of gene sets and an enrichment score that indicates sample-wise gene-level summaries of expression. Since gene sets with a small amount of genes don’t provide much information, a filter was set to remove those gene sets with 5 or less genes.

library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
dim(GSexpr)
[1] 2698  100

With the data ready, a SVA was completed without calling any gene differentially expressed.

mod <- model.matrix(~se.filt$type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~ patientid, data = colData(se.filt))
svaobj <- sva(GSexpr, mod, mod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  
modSVs <- cbind(mod, svaobj$sv)

corfit <- duplicateCorrelation(GSexpr, modSVs, block = se.filt$cellline)
fit <- lmFit(GSexpr, modSVs)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 1883

The 1883 gene sets that are differentially expressed at 1% FDR are shown in red in Figure 14.

Volcano and MA plots for GVSA

Figure 14: Volcano and MA plots for GVSA

3.4.3 Gene Ontology Analysis

The next step was to complete a gene ontology analysis. This procedure was completed to see if any differentially expressed genes were associated with certain biological processes defined by the Gene Ontology system.

First, a parameter object was built that specified the gene universe of interest, the set of 1455 DE genes, the ontology, and other information. The ontology selected was “BP”, which matched genes to the Biological Processes associated with the GO Terms.

geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 9963
library(GOstats)
Loading required package: Category
Loading required package: Matrix

Attaching package: 'Matrix'
The following object is masked from 'package:S4Vectors':

    expand

Attaching package: 'GOstats'
The following object is masked from 'package:AnnotationDbi':

    makeGOGraph
params <- new("GOHyperGParams", geneIds=DEgenes, universeGeneIds=geneUniverse,
            annotation="org.Hs.eg.db", ontology="BP",
            pvalueCutoff=0.01, testDirection="over")
Warning in makeValidParams(.Object): removing geneIds not in universeGeneIds

In the GO analysis, GO terms can sometime overlap. Therefore, a conditional test was chosen. An Odds Ratio of “Inf” indicated that all genes within a gene set were differentially expressed.

conditional(params) <- TRUE
hgOverCond <- hyperGTest(params)
hgOverCond
Gene to GO BP Conditional test for over-representation 
7579 GO BP ids tested (47 have p < 0.01)
Selected gene set size: 1127 
    Gene universe size: 8939 
    Annotation package: org.Hs.eg 
goresults <- summary(hgOverCond)
head(goresults)
      GOBPID       Pvalue OddsRatio   ExpCount Count Size
1 GO:0002011 0.0003377497  4.519555  3.5301488    11   28
2 GO:0007189 0.0003426225  2.749349  8.9514487    20   71
3 GO:0010876 0.0005206152  1.889882 24.4588880    41  194
4 GO:0007187 0.0009287784  2.078975 16.0117463    29  127
5 GO:0006631 0.0010427968  1.858127 22.9459671    38  182
6 GO:0007172 0.0011310055 27.821906  0.6303837     4    5
                                                                                         Term
1                                                        morphogenesis of an epithelial sheet
2                   adenylate cyclase-activating G protein-coupled receptor signaling pathway
3                                                                          lipid localization
4 G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger
5                                                                fatty acid metabolic process
6                                                                     signal complex assembly

Since signficiantly enriched gene sets formed by a few genes are not very reliable, a filter was added so those sets with less than or equal to 5 genes were removed. The table was than ordered by Odds Ratio.

goresults <- goresults[goresults$Size >= 5 & goresults$Count >= 5, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
goresults
       GOBPID       Pvalue OddsRatio  ExpCount Count Size
23 GO:0045737 0.0046008469  6.958111  1.260767     5   10
26 GO:0043537 0.0048376104  5.221231  1.765074     6   14
11 GO:0006376 0.0019328356  4.647006  2.521535     8   20
37 GO:0032309 0.0072136074  4.640500  1.891151     6   15
38 GO:0048025 0.0072136074  4.640500  1.891151     6   15
1  GO:0002011 0.0003377497  4.519555  3.530149    11   28
22 GO:0019369 0.0044747007  4.432386  2.269381     7   18
17 GO:0003416 0.0027814383  4.288994  2.647612     8   21
45 GO:0010171 0.0086905926  3.749519  2.521535     7   20
9  GO:1990573 0.0012586714  3.656810  4.034456    11   32
27 GO:0008333 0.0057406264  3.301808  3.530149     9   28
28 GO:0043536 0.0057406264  3.301808  3.530149     9   28
29 GO:0031055 0.0058532649  3.031801  4.160532    10   33
24 GO:0034508 0.0046068579  2.951682  4.664839    11   37
47 GO:0006336 0.0091859439  2.788541  4.412686    10   35
2  GO:0007189 0.0003426225  2.749349  8.951449    20   71
10 GO:0006821 0.0018980524  2.530286  8.068912    17   64
43 GO:0098661 0.0074672695  2.277330  7.690681    15   61
21 GO:0035265 0.0042352048  2.178922 10.086139    19   80
4  GO:0007187 0.0009287784  2.078975 16.011746    29  127
15 GO:0042180 0.0024372379  2.076760 13.238058    24  105
19 GO:0019935 0.0032098814  1.960036 15.003132    26  119
3  GO:0010876 0.0005206152  1.889882 24.458888    41  194
5  GO:0006631 0.0010427968  1.858127 22.945967    38  182
16 GO:0030198 0.0026756023  1.757798 23.324197    37  185
20 GO:0051222 0.0032660275  1.669765 27.610807    42  219
44 GO:0051047 0.0083337105  1.573876 28.241190    41  224
46 GO:0042391 0.0088202396  1.549489 30.006265    43  238
30 GO:0006820 0.0059600165  1.487018 41.983555    58  333
31 GO:0051301 0.0062639237  1.462798 45.513704    62  361
25 GO:0040007 0.0047313203  1.400415 64.929522    85  515
                                                                                          Term
23            positive regulation of cyclin-dependent protein serine/threonine kinase activity
26                              negative regulation of blood vessel endothelial cell migration
11                                                                  mRNA splice site selection
37                                                                         icosanoid secretion
38                                       negative regulation of mRNA splicing, via spliceosome
1                                                         morphogenesis of an epithelial sheet
22                                                          arachidonic acid metabolic process
17                                                                    endochondral bone growth
45                                                                          body morphogenesis
9                                                  potassium ion import across plasma membrane
27                                                              endosome to lysosome transport
28                              positive regulation of blood vessel endothelial cell migration
29                                                          chromatin remodeling at centromere
24                                                                 centromere complex assembly
47                                             DNA replication-independent nucleosome assembly
2                    adenylate cyclase-activating G protein-coupled receptor signaling pathway
10                                                                          chloride transport
43                                                     inorganic anion transmembrane transport
21                                                                                organ growth
4  G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger
15                                                           cellular ketone metabolic process
19                                                        cyclic-nucleotide-mediated signaling
3                                                                           lipid localization
5                                                                 fatty acid metabolic process
16                                                           extracellular matrix organization
20                                                    positive regulation of protein transport
44                                                            positive regulation of secretion
46                                                            regulation of membrane potential
30                                                                             anion transport
31                                                                               cell division
25                                                                                      growth

3.5 Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GO.db_3.7.0                 GOstats_2.48.0             
 [3] Category_2.48.1             Matrix_1.2-17              
 [5] GSVA_1.30.0                 GSVAdata_1.18.0            
 [7] hgu95a.db_3.2.3             org.Hs.eg.db_3.7.0         
 [9] GSEABase_1.44.0             graph_1.60.0               
[11] sva_3.30.1                  genefilter_1.64.0          
[13] mgcv_1.8-28                 nlme_3.1-140               
[15] geneplotter_1.60.0          annotate_1.60.1            
[17] XML_3.98-1.20               AnnotationDbi_1.44.0       
[19] lattice_0.20-38             edgeR_3.24.3               
[21] limma_3.38.3                SummarizedExperiment_1.12.0
[23] DelayedArray_0.8.0          BiocParallel_1.16.6        
[25] matrixStats_0.54.0          Biobase_2.42.0             
[27] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[29] IRanges_2.16.0              S4Vectors_0.20.1           
[31] BiocGenerics_0.28.0         knitr_1.23                 
[33] BiocStyle_2.10.0           

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.3          shiny_1.3.2           
 [4] statmod_1.4.32         BiocManager_1.30.4     highr_0.8             
 [7] RBGL_1.58.2            blob_1.1.1             GenomeInfoDbData_1.2.0
[10] yaml_2.2.0             RSQLite_2.1.1          digest_0.6.19         
[13] RColorBrewer_1.1-2     promises_1.0.1         XVector_0.22.0        
[16] htmltools_0.3.6        httpuv_1.5.1           pkgconfig_2.0.2       
[19] bookdown_0.11          zlibbioc_1.28.0        xtable_1.8-4          
[22] later_0.8.0            survival_2.44-1.1      magrittr_1.5          
[25] mime_0.7               memoise_1.1.0          evaluate_0.14         
[28] tools_3.5.3            stringr_1.4.0          locfit_1.5-9.1        
[31] compiler_3.5.3         grid_3.5.3             RCurl_1.95-4.12       
[34] AnnotationForge_1.24.0 bitops_1.0-6           rmarkdown_1.13        
[37] codetools_0.2-16       DBI_1.0.0              R6_2.4.0              
[40] bit_1.1-14             shinythemes_1.1.2      Rgraphviz_2.26.0      
[43] stringi_1.4.3          Rcpp_1.0.1             xfun_0.7              

4 References